Dynamic rate adjustment for interaction monitoring

ABSTRACT

Methods and systems for implementing dynamic rate adjustment for interaction monitoring are disclosed. At an entity, the collection of trace information is initiated according to a first sampling rate. The trace information is indicative of interactions between the entity and one or more additional entities. A second sampling rate is determined based at least in part on information external to the entity. The second sampling rate is determined after the collection of the trace information is initiated at the entity according to the first sampling rate. At the entity, the collection of additional trace information is initiated according to the second sampling rate.

This application is a continuation of U.S. patent application Ser. No.14/297,498, filed Jun. 5, 2014, now U.S. Pat. No. 9,626,275, which ishereby incorporated by reference herein in its entirety.

BACKGROUND

Many companies and other organizations operate computer networks thatinterconnect numerous computing systems to support their operations,such as with the computing systems being co-located (e.g., as part of alocal network) or instead located in multiple distinct geographicallocations (e.g., connected via one or more private or publicintermediate networks). For example, distributed systems housingsignificant numbers of interconnected computing systems have becomecommonplace. Such distributed systems may provide back-end services toweb servers that interact with clients. Such distributed systems mayalso include data centers that are operated by entities to providecomputing resources to customers. Some data center operators providenetwork access, power, and secure installation facilities for hardwareowned by various customers, while other data center operators provide“full service” facilities that also include hardware resources madeavailable for use by their customers. However, as the scale and scope ofdistributed systems have increased, the tasks of provisioning,administering, and managing the resources have become increasinglycomplicated.

Web servers backed by distributed systems may provide marketplaces thatoffer goods and/or services for sale to consumers. For instance,consumers may visit a merchant's website to view and/or purchase goodsand services offered for sale by the merchant (and/or third partymerchants). Some network-based marketplaces (e.g., Internet-basedmarketplaces) include large electronic catalogues of items offered forsale. For each item offered for sale, such electronic cataloguestypically include at least one product detail page (e.g., a web page)that specifies various information about the item, such as a descriptionof the item, one or more pictures of the item, as well as specifications(e.g., weight, dimensions, capabilities) of the item. In various cases,such network-based marketplaces may rely on a service-orientedarchitecture to implement various business processes and other tasks.The service-oriented architecture may be implemented using a distributedsystem that includes many different computing resources and manydifferent services that interact with one another, e.g., to produce aproduct detail page for consumption by a client of a web server.

In order to monitor the performance or behavior of such a distributedsystem, the flow of data through the system may be traced. Theinformation resulting from the trace may be analyzed, and actions toimprove the performance or behavior may be taken in response to theanalysis. However, for sufficiently large and complex systems, thecomputational, network, and/or storage resources required to trace everytransaction or to store every trace may exceed an acceptable measure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example system environment for dynamic rateadjustment for interaction monitoring, according to some embodiments.

FIG. 2 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, according to someembodiments.

FIG. 3 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a sampling rate at services, according to someembodiments.

FIG. 4 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a rate at an interaction monitoring daemon, according tosome embodiments.

FIG. 5 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a rate at a storage node, according to some embodiments.

FIG. 6 is a flowchart illustrating a method for dynamically adjusting asampling rate for a service, according to some embodiments.

FIG. 7 is a flowchart illustrating a method for dynamically adjusting apreservation rate for an interaction monitoring daemon, according tosome embodiments.

FIG. 8 is a flowchart illustrating a method for dynamically adjusting apreservation rate for a storage node, according to some embodiments.

FIG. 9 is a flowchart illustrating a method for dynamically adjusting arate for an entity using a feedback loop, according to some embodiments.

FIG. 10 illustrates an example format of a request identifier, accordingto some embodiments.

FIG. 11 illustrates an example transaction flow for fulfilling a rootrequest, according to some embodiments.

FIG. 12 illustrates one example of a service of a service-orientedsystem, according to some embodiments.

FIG. 13 illustrates an example data flow diagram for the collection oflog data and generation of a call graph, according to some embodiments.

FIG. 14 illustrates an example visual representation of a call graph andrequest identifiers from which such call graph is generated, accordingto some embodiments.

FIG. 15 illustrates an example system configuration for tracking servicerequests, according to some embodiments.

FIG. 16 illustrates an example of a computing device that may be used insome embodiments.

While embodiments are described herein by way of example for severalembodiments and illustrative drawings, those skilled in the art willrecognize that embodiments are not limited to the embodiments ordrawings described. It should be understood, that the drawings anddetailed description thereto are not intended to limit embodiments tothe particular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope as defined by the appended claims. The headings usedherein are for organizational purposes only and are not meant to be usedto limit the scope of the description or the claims. As used throughoutthis application, the word “may” is used in a permissive sense (i.e.,meaning “having the potential to”), rather than the mandatory sense(i.e., meaning “must”). Similarly, the words “include,” “including,” and“includes” mean “including, but not limited to.”

DETAILED DESCRIPTION OF EMBODIMENTS

Various embodiments of methods and systems for dynamic rate adjustmentfor interaction monitoring are described. In a distributed systemcomprising various services and/or other components, interactionsbetween services (e.g., call paths) may be monitored to generate traceinformation. Using the systems and methods described herein, rates forinitiating traces and/or for preserving trace information may bedynamically adjusted. A rate may be adjusted based on informationexternal to the service or component that applies the rate to initiatetraces or preserve trace information. For example, various aspects ofthe performance of the distributed system may be monitored, and thesampling rates and/or preservation rates may be adjusted accordingly. Byadjusting rates in this manner, the amount of trace information that isgenerated and/or preserved may be adjusted based on conditions in thedistributed system. Accordingly, the costs of performing the interactionmonitoring and storing the resulting data may be considered in theoptimization of the distributed system.

FIG. 1 illustrates an example system environment for dynamic rateadjustment for interaction monitoring, according to some embodiments.The example system environment may include an interaction monitoringsystem 100. The interaction monitoring system 100 may include aplurality of components for monitoring interactions between entitiessuch as services (including components of services) and efficientlystoring trace information based on the monitored interactions. Forexample, the interaction monitoring system 100 may include interactionmonitoring functionality 110, trace information storage functionality120, and trace information analysis functionality 130.

The interaction monitoring system 100 may comprise one or more computingdevices, any of which may be implemented by the example computing device3000 illustrated in FIG. 16. In various embodiments, the functionalityof the different services, components, and/or modules of the interactionmonitoring system 100 (e.g., interaction monitoring functionality 110,trace information storage functionality 120, and trace informationanalysis functionality 130) may be provided by the same computing deviceor by different computing devices. If any of the various components areimplemented using different computing devices, then the respectivecomputing devices may be communicatively coupled, e.g., via a network.Each of the interaction monitoring functionality 110, trace informationstorage functionality 120, and trace information analysis functionality130 may represent any combination of software and hardware usable toperform their respective functions, as discussed as follows.

The interaction monitoring functionality 110 may monitor or trackinteractions between entities such as services or components of servicesin a distributed, service-oriented system, such as a system structuredaccording to a service-oriented architecture (SOA). A service-orientedarchitecture may include multiple services configured to communicatewith each other (e.g., through message passing) to carry out varioustasks, such as business process functions. As shown in the example ofFIG. 1, the interaction monitoring functionality 110 may monitorinteractions between or among services 150A and 150B through 150N.Although three services 150A-150N are shown for purposes of illustrationand example, it is contemplated that any suitable number of services(including instances of the same service and/or instances of differentservices) may be used with the interaction monitoring system 100. Theservices may be distributed across multiple computing instances and/ormultiple subsystems that are connected, e.g., via one or more networks.In some embodiments, such services may be loosely coupled in order tominimize (or in some cases eliminate) interdependencies among services.This modularity may enable services to be reused in order to buildvarious applications through a process referred to as orchestration. Aservice may include one or more components that may also participate inthe service-oriented architecture, e.g., by passing messages to otherservices or to other components within the same service.

Service-oriented systems may be configured to process requests fromvarious internal or external systems, such as client computer systems orcomputer systems consuming networked-based services (e.g., webservices). For instance, an end-user operating a web browser on a clientcomputer system may submit a request for data (e.g., data associatedwith a product detail page, a shopping cart application, a checkoutprocess, search queries, etc.). In another example, a computer systemmay submit a request for a web service (e.g., a data storage service, adata query, etc.). In general, services may be configured to perform anyof a variety of business processes. The service interactions may includerequests (e.g., for services to be performed), responses to requests,and other suitable events.

The services and components described herein may include but are notlimited to one or more of network-based services (e.g., a web service),applications, functions, objects, methods (e.g., objected-orientedmethods), subroutines, or any other set of computer-executableinstructions. In various embodiments, such services and components maycommunicate through any of a variety of communication protocols,including but not limited to the Simple Object Access Protocol (SOAP).In various embodiments, messages passed between services and componentsmay include but are not limited to Extensible Markup Language (XML)messages or messages of any other markup language or format. In variousembodiments, descriptions of operations offered by one or more of theservices and components may include Web Service Description Language(WSDL) documents, which may in some cases be provided by a servicebroker accessible to the services and components. References to servicesherein may include components within services.

In one embodiment, the interaction monitoring functionality 110 maymonitor interactions between services in any suitable environment, suchas a production environment and/or a test environment. The productionenvironment may be a “real-world” environment in which a set ofproduction services are invoked, either directly or indirectly, byinteractions with a real-world client, consumer, or customer, e.g., ofan online merchant or provider of web-based services. In one embodiment,the test environment may be an environment in which a set of testservices can be invoked in order to test their functionality. The testenvironment may be isolated from real-world clients, consumers, orcustomers of an online merchant or provider of web-based services. Inone embodiment, the test environment may be implemented by configuringsuitable elements of computing hardware and software in a mannerdesigned to mimic the functionality of the production environment. Inone embodiment, the test environment may temporarily borrow resourcesfrom the production environment. In one embodiment, the test environmentmay be configured to shadow the production environment, such thatindividual test services represent shadow instances of correspondingproduction services. When the production environment is run in shadowmode, copies of requests generated by production services may beforwarded to shadow instances in the test environment to execute thesame transactions.

To monitor the service interactions, lightweight instrumentation may beadded to the services. The instrumentation (e.g., a reporting agentassociated with each service) may collect and report data associatedwith each inbound request, outbound request, or other serviceinteraction (e.g., a timer-based interaction) processed by a service.Further aspects of the service instrumentation, interaction monitoringfunctionality 110, and trace information storage functionality 120 arediscussed below with respect to FIGS. 10-15.

The trace information storage functionality 120 may collect, aggregate,and/or store trace information generated using the interactionmonitoring functionality 110. In one embodiment, one or more interactionmonitoring daemons 160 may periodically receive elements of traceinformation from one or more of the services 150A-150N. In someembodiments, the one or more interaction monitoring daemons 160 maydetermine whether to discard or preserve individual elements of thetrace information. When trace information is to be preserved, the one ormore interaction monitoring daemons 160 may pass the trace informationto one or more storage nodes 170. In one embodiment, the one or morestorage nodes 170 may cause the trace information to be stored, e.g., inpersistent storage of any suitable configuration. In some embodiments,the one or more interaction monitoring daemons 160 may store individualelements of the trace information in persistent local storage, e.g.,storage maintained by the one or more interaction monitoring daemons160. In some embodiments, the one or more interaction monitoring daemons160 may generate a summary of the trace information and store thesummary or send the summary to one or more other components. In someembodiments, the one or more interaction monitoring daemons 160 may sendfurther information regarding the trace information to one or more othercomponents. Aspects of the trace information storage functionality 120may be implemented using the log repository 2410 illustrated in FIGS. 13and 15.

The interaction monitoring system 100 may generate one or more tracesbased on the collected service interactions. In one embodiment, one ormore suitable elements of the trace information analysis functionality130 may analyze the trace information generated by the interactionmonitoring functionality 110 in order to generate traces. Each of thetraces may collect data indicative of service interactions involved insatisfying a particular initial request. In one embodiment, a particulartrace may include data indicative of a route taken in satisfying aservice request and/or a hierarchy of call pathways between services.The route may correspond to a set of call pathways between services. Thecall pathways may represent inbound service requests and outboundservice requests relative to a particular service. To process a givenreceived request, the system described herein may invoke one or more ofthe types of services described above. As used herein, an initialrequest may be referred to as the “root request.” In variousembodiments, the root request may but need not originate from a computersystem outside of the service-oriented system described herein. In manyembodiments, a root request may be processed by an initial service,which may then call one or more other services. Additionally, each ofthose services may also call one or more other services, and so on untilthe root request is completely fulfilled. Accordingly, the particularservices called to fulfill a request may be represented as a call graphthat specifies, for each particular service of multiple services calledto fulfill the same root request, the service that called the particularservice and any services called by the particular service.

In one embodiment, the trace information analysis functionality 130 mayinclude a call graph generation functionality 180 that generates one ormore call graphs based on trace information generated using theinteraction monitoring functionality 110. In some embodiments, the traceinformation used to generate one or more call graphs may be retrievedfrom one or more storage nodes 170 or received from one or more daemons160. A call graph based on a trace may be a hierarchical data structurethat include nodes representing the services and edges representing theinteractions. In some cases, a call graph may be a deep and broad treewith multiple branches each representing a series of related servicecalls. Suitable elements of the interaction monitoring system 100,including the trace information analysis functionality 130, may use anysuitable data and metadata to build the traces and/or call graphs, suchas request identifiers and metadata associated with services and theirinteractions. The request identifiers and metadata are discussed belowwith respect to FIGS. 10-15. Aspects of the call graph generationfunctionality 180 may be implemented using the call graph generationlogic 2420 illustrated in FIGS. 13 and 15.

For clarity of description, various terms may be useful for describingelements of a trace or call graph. Note that the following terminologymay only be applicable to services and requests of a given trace or callgraph. In other words, the following terminology may only be applicablefor services and requests associated with the same root request. Fromthe perspective of a particular service, any service that calls theparticular service may be referred to as a “parent service.”Furthermore, from the perspective of a particular service, any servicethat the particular service calls may be referred to as a “childservice.” In a similar fashion, from the perspective of a particularrequest, any request from which the particular request stems may bereferred to as a “parent request.” Furthermore, from the perspective ofa particular request, any request stemming from the particular requestmay be referred to as a “child request.” Additionally, as used hereinthe phrases “request,” “call,” “service request” and “service call” maybe used interchangeably. Note that this terminology refers to the natureof the propagation of a particular request throughout the present systemand is not intended to limit the physical configuration of the services.As may sometimes be the case with service-oriented architecturesemploying modularity, each service may in some embodiments beindependent of other services in the service-oriented system (e.g., thesource code of services or their underlying components may be configuredsuch that interdependencies among source and/or machine code are notpresent).

The generation of a particular trace may end, and the trace may befinalized, based on any suitable determination. In one embodiment, thetrace may be finalized after a sufficient period of time has elapsedwith no further service interactions made for any relevant service. Inone embodiment, heuristics or other suitable rule sets may be used todetermine a timeout for a lack of activity to satisfy a particular rootrequest. The timeout may vary based on the nature of the root request.For example, a root request to generate a web page using a hierarchy ofservices may be expected to be completed within seconds; accordingly,the trace may be finalized within minutes. As another example, a rootrequest to fulfill and ship a product order may be expected to becompleted within days or weeks; accordingly, the trace may be finalizedwithin weeks or even months.

In some embodiments, the trace information storage functionality 120 maystore data corresponding to the traces in an efficient manner, such asby filtering and discarding duplicative traces. Using deduplicationtechniques, each of the traces generated based on the monitored serviceinteractions may be compared to a set of stored traces, and each of thestored traces may represent a unique trace (e.g., a unique combinationof services used in satisfying a root request), an example of a type oftrace, or a trace that otherwise satisfies one or more of a set ofpredefined conditions. If one of the traces does not match any of thestored traces, or if it is sufficiently dissimilar to all of the storedtraces, then the trace may be added to the set of stored traces.However, if one of the traces matches or is sufficiently similar to oneof the stored traces, the trace may be discarded. In one embodiment,relevant statistics may be updated for the stored trace matching thediscarded trace. The statistics may include, for example, a count ofhits on the particular trace, an average latency, a percentile latency,and/or any other suitable statistics. For example, the hit count for thestored trace may be incremented for every matching trace that isdiscarded.

In one embodiment, all or nearly all of the service interactions may bemonitored using the techniques described herein. In one embodiment, onlya percentage of the service interactions may be monitored, and/or tracesmay be generated for only a percentage of traceable service requests.Any suitable technique may be used to identify which of the serviceinteractions and/or root requests to trace. In one embodiment,probabilistic sampling techniques may be used to initiate traces for acertain percentage (e.g., 1%) of all traceable root requests. As will bedescribed in greater detail below, sampling rates for services and/orpreservation rates for daemons and/or storage nodes may be dynamicallyadjusted based on suitable conditions throughout the distributed system.

FIG. 2 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, according to someembodiments. As discussed above with respect to FIG. 1, the interactionmonitoring system 100 may include a plurality of components formonitoring interactions between entities such as services andefficiently storing trace information based on the monitoredinteractions. For example, the interaction monitoring system 100 mayinclude a plurality of services such as services 150A and 150N through150N, one or more interaction monitoring daemons 160, and one or morestorage nodes 170.

Additionally, the interaction monitoring system 100 may include at leastone interaction monitoring server 200. Using the interaction monitoringserver 200, conditions in a distributed system may be monitored, andsampling rates and/or preservation rates for trace information may bedynamically adjusted in a feedback loop. In various embodiments, thetargets of the dynamic rate adjustment may include one or more services150A-150N, one or more daemons 160, and/or one or more storage nodes170. The interaction monitoring server 200 may comprise one or morecomputing devices, any of which may be implemented by the examplecomputing device 3000 illustrated in FIG. 16. In various embodiments,the functionality of the different services, components, and/or modulesof the interaction monitoring server 200 (e.g., system monitoringfunctionality 210, rate determination functionality 220, and policyimplementation functionality 230) may be provided by the same computingdevice or by different computing devices. If any of the variouscomponents are implemented using different computing devices, then therespective computing devices may be communicatively coupled, e.g., via anetwork. Each of the system monitoring functionality 210, ratedetermination functionality 220, and policy implementation functionality230 may represent any combination of software and hardware usable toperform their respective functions, as discussed as follows. In someembodiments, all or part of the functionality of the interactionmonitoring server 200 may be distributed among other entities,potentially including one or more of the services 150A-150N, theinteraction monitoring daemon(s) 160, and/or the storage node(s) 170.

In one embodiment, the system monitoring functionality 210 may monitorany suitable conditions and/or attributes throughout a system such as adistributed, service-oriented system. Accordingly, the system monitoringfunctionality 210 may monitor any suitable conditions and/or attributesin one or more of the services 150A-150N, one or more of the daemons160, and/or one or more of the storage nodes 170. Additionally, thesystem monitoring functionality 210 may monitor any other suitablecomponents or attributes of the service-oriented system, such as otherdata sources 240. For example, the system monitoring functionality 210may monitor any suitable performance metrics, such as metrics relatingto throughput, latency, processor usage, memory usage, storage usage,network usage, cache efficiency, component I/O performance, etc. Theperformance metrics may relate to individual components, such as thenetwork traffic at a particular service or host, or to components in theaggregate, such as the network traffic at a set of services or hosts. Asanother example, the system monitoring functionality 210 may monitor orotherwise have knowledge of the cost of various components, e.g., thecost of particular storage resources, the cost of particular types ofservice hosts, etc. The system monitoring functionality 210 may acquiredata using any suitable techniques, including push and/or pulltechniques.

In some embodiments, the system monitoring functionality 210 may alsoacquire information from one or more external data sources 250, e.g.,sources outside the distributed system that includes the services150A-150N, daemon(s) 160, storage node(s) 170, and other data source(s)240. For example, the external data source(s) 250 may include newsfeedsfor current events, financial events, weather events, internetconditions, etc. In general, the external data source(s) 250 may includeany data sources providing information that may prove useful indynamically adjusting the rates for initiating traces and/or preservingtrace information in the distributed system.

In one embodiment, the rate determination functionality 220 maydetermine sampling rates for initiating traces at individual entitiessuch as services. The sampling rates may be determined based on thesystem conditions and/or attributes monitored using the systemmonitoring functionality 210. Similarly, the interaction monitoringserver 200 may determine rates for preserving trace information atdaemons and/or storage nodes. In determining rates, the ratedetermination functionality 220 may consider information from a varietyof sources, including sources within and without the system thatincludes the target of the rate change. In general, the ratedetermination functionality 220 may base the rate determination oninformation external to the entity whose rate change is determined.However, the rate determination functionality 220 may also considerinformation provided by the entity whose rate change is determined.

In various embodiments, numerous types of externalities may inform thedynamic adjustment of rates. In one embodiment, the initial samplingrate for a service may be determined based on a variety of factors(e.g., as expressed as coefficients in a formula) when the service isdeployed. For example, a sampling rate for a service may be based (inpart) on a fleet size of the hosts that implement the service. As thefleet size changes, the size of the fleet may be used to dynamicallyadjust the sampling rate for a particular service within the fleet. Thesize of the fleet represents information external to any particularservice within the fleet. As another example, the information externalto a service may include a determination that all upstream callers areconfigured to initiate traces; in this case, the sampling initiation maybe disabled for the downstream service. Similarly, if the systemmonitoring functionality 210 determines that some upstream callers arenot configured to initiate traces, then the sampling initiation may beenabled again for the downstream service with a suitable sample rate.Another source of external information may include a determination thatmany redundant traces (e.g., traces involving the same or similar callpaths) are being generated; in this case, the sampling rate may bereduced for one or more relevant services. As another example, as totalnetwork traffic increases, the number of different use cases may notscale linearly, so the sampling rates for services may be reduced.Similarly, sampling rates may be increased as total traffic decreases.Additionally, global performance metrics may be monitored for thedistributed system so that sampling rates may be reduced if criticalthresholds have been crossed or are anticipated to be crossed. Ingeneral, the external information may be gathered to indicate that theamount of trace information is excessive (e.g., if too much data or toosimilar data is being observed) or to indicate that the amount of traceinformation is insufficient (e.g., if debugging is desired for anobserved performance problem). Machine learning techniques may be usedto identify such external information, e.g., to anticipate changes innetwork traffic or service load.

In one embodiment, the policy implementation functionality 230 mayimplement the new sampling rates, preservation rates, and/or relatedpolicies for monitoring service interactions. In one embodiment, the newrates and/or policies may be assigned to components without needing toredeploy or restart the affected components, e.g., while the componentsare deployed and operational. The policy implementation functionality230 may use any suitable technique to promulgate rates and/or policiesto particular services, daemons, and/or storage nodes. For example,sampling policies (including sampling rates) may be distributed from theserver 200 to various clients (including services 150A-150N) usingdistribution/update techniques with client-side reactions. As additionalexamples, sampling rates may be changed using client-side responses tometric queries or peer-based communication between members of a hostclass.

FIG. 3 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a sampling rate at services, according to someembodiments. The sampling rate at a particular service may indicate arate at which traces should be initiated at that service. For example,if a particular service receives an incoming service request (or otherinteraction), and if tracing has not already been enabled in therequest, the service may determine whether to initiate a trace for therequest based on the sampling rate assigned to the service. The samplingrate at a service may indicate a percentage of all interactions to betraced, a time-based threshold (e.g., a policy to generate a particularnumber of traces per unit of time), or any other suitable policy fordetermining when to initiate a trace. As discussed above, theinteraction monitoring server 200 may monitor conditions and/orattributes in a distributed system and determine adjusted sampling ratesand/or related policies. In one embodiment, the interaction monitoringserver 200 may determine and assign individual rates for individualservices or for categories of services. For example, as shown in FIG. 3,the interaction monitoring server 200 may determine a sampling rate 151Afor service 150A, a sampling rate 151B for service 150B, and a samplingrate 151N for service 151N. In various embodiments, any of the samplingrates 151A-151N may differ from other sampling rates or be identical toother sampling rates, as appropriate.

In one embodiment, one or more of the entities subject to rate changesmay perform self-adjustment of rates. For example, one of the services150A-150N may implement the system monitoring 210 and rate determination220 functionalities to modify its own sampling rate. The ratesdetermined using the self-adjustment may be limited to upper and/orlower bounds or other policies as permitted by the interactionmonitoring server 200. In one embodiment, one or more of the entitiesmay trigger the rate determination functionality 220 by sending anappropriate message to the interaction monitoring server 200. Forexample, one of the services 150A-150N may trigger the ratedetermination functionality 220 upon detecting a change in service usageor other local conditions. Additionally, a sampling rate or preservationrate may be determined and assigned to an entity that did not previouslyenable any operation making use of a sampling rate or preservation rate.

FIG. 4 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a rate at an interaction monitoring daemon, according tosome embodiments. Using similar techniques as the sampling ratedetermination discussed above with respect to FIG. 3, the ratedetermination functionality 220 may determine one or more storage ratesand/or related policies for components in the interaction monitoringsystem 100. The storage rates may be determined based on the systemmonitoring 210. Storage rates may also be referred to herein aspreservation rates or retention rates. In one embodiment, storage ratesmay be applied by components of the trace information storagefunctionality 120 (e.g., one or more daemons 160 and/or storage nodes170) to preserve or discard trace information when all or substantiallyall of the interactions among services are traced, resulting in a verylarge amount of trace information to be processed by the traceinformation storage functionality 120.

The storage rate at a particular component may indicate a rate at whichtrace information should be stored, preserved, or maintained at thatcomponent. For example, if a particular daemon receives traceinformation from one or more services, the daemon may determine whetherto preserve or discard the trace information based on the storage rateassigned to the daemon. The storage rate at a service may indicate apercentage of all trace data to be preserved, a time-based threshold(e.g., a policy to preserve a particular amount of trace data per unitof time), or any other suitable policy for determining when to store orpreserve trace information. As discussed above, the interactionmonitoring server 200 may monitor conditions and/or attributes in adistributed system and determine adjusted storage rates and/or relatedpolicies. In one embodiment, the interaction monitoring server 200 maydetermine and assign individual rates for individual components of thetrace information storage functionality 120. For example, as shown inFIG. 4, the interaction monitoring server 200 may determine one or morestorage rates 161 for the daemon(s) 160. In one embodiment, thedaemon(s) may apply different storage rates or related policies to traceinformation associated with different services or hosts, e.g., bymasking the relevant bits in each trace ID to identify the originatinghost name or service name in the trace information. In one embodiment,the same storage rates may be assigned to multiple daemons so thatpolicies for trace information preservation may be carried out in auniform or consistent manner across the distributed system.

FIG. 5 illustrates further aspects of an example system environment fordynamic rate adjustment for interaction monitoring, including dynamicadjustment of a rate at a storage node, according to some embodiments.As discussed above, the interaction monitoring server 200 may monitorconditions and/or attributes in a distributed system and determineadjusted storage rates. In one embodiment, the interaction monitoringserver 200 may determine and assign individual rates for individualcomponents of the trace information storage functionality 120. Forexample, as shown in FIG. 5, the interaction monitoring server 200 maydetermine one or more storage rates 171 for the storage node 170. Thestorage rate at a particular storage node may indicate a rate at whichtrace information should be stored, preserved, or maintained at thatstorage node. For example, when a particular storage node receives traceinformation, the storage node may determine whether to store or discardthe trace information based on the storage rate assigned to the storagenode. In one embodiment, the storage node(s) may apply different storagerates or related policies to trace information associated with differentservices or hosts. In one embodiment, the same storage rates may beassigned to multiple storage nodes so that policies for traceinformation preservation may be carried out in a consistent manneracross the distributed system.

Although FIGS. 3-5 illustrate the potential targets of rate changes asservices 150A-150N, daemon(s) 160, and storage node(s) 170, it iscontemplated that similar techniques may be used to adjust samplingrates and/or information retention rates for other types of entities.For example, the interaction monitoring system 100 may be used tomonitor and change sampling rates and/or information retention rates forobjects within elements of executable software, hardware componentswithin a system, software elements communicating with inter-processcommunication within a single host, etc. In one embodiment, theinteraction monitoring system 100 may be used to monitor and changesampling rates and/or information retention rates for nodes in a socialgraph.

FIG. 6 is a flowchart illustrating a method for dynamically adjusting asampling rate for a service, according to some embodiments. As shown in605, trace information may be collected at a service according to afirst sampling rate. The trace information may describe interactionsbetween the service and one or more additional services in aservice-oriented system. The term “service” as used in FIG. 6 may referto any service or component within a service. In one embodiment,initiation of a trace may be performed based on the first sampling rate,such that only a percentage (as indicated by the sampling rate) or othersubset of interactions are traced.

As shown in 610, a second sampling rate may be determined based (atleast in part) on information external to the service. The second ratemay be higher or lower than the first rate. For example, the secondsampling rate may be determined by an interaction monitoring serverbased on the monitoring of suitable conditions and/or attributesthroughout the service-oriented system. In one embodiment, thecollective performance of the service and the additional services, aswell as other components, may be monitored to generate the informationexternal to the service. In one embodiment, the individual performanceof various additional services and/or other components may be monitoredto generate the information external to the service. Accordingly, atleast a portion of the information external to the service may representglobal conditions in the service-oriented system. In some embodiments,externalities used to inform the dynamic adjustment of the sampling ratefor a service may include, for example: the volume of traffic in all orpart of the service-oriented system, the size of a fleet of hostsimplementing the service, a determination of redundancy in the traceinformation, an anticipation of higher or lower traffic volume at afuture point in time, a determination that a performance threshold hasbeen crossed, and/or a determination that a performance threshold willbe crossed at a future point in time.

As shown in 615, the second sampling rate may be assigned to theservice. In one embodiment, the interaction monitoring server may assignthe sampling rate to the service using any suitable technique for policyimplementation. As shown in 620, additional trace data may be collectedat the service according to the second sampling rate. In one embodiment,initiation of a trace may be performed based on the second samplingrate, such that only a percentage (as indicated by the sampling rate) orother subset of interactions are traced. In this manner, a feedback loopmay be implemented to dynamically adjust the amount of traces beinginitiated at a particular service. To implement a feedback loop for ratechanges, any of the operations shown in FIG. 6 may be performedrepeatedly and/or continuously. For example, the operations shown in610, 615, and 620 may be repeated any suitable number of times to adjustthe sampling rate based on recent conditions, current conditions, and/oranticipated conditions in the system.

FIG. 7 is a flowchart illustrating a method for dynamically adjusting apreservation rate for an interaction monitoring daemon, according tosome embodiments. As shown in 705, trace information may be collected,where the trace information describes interactions between services in aservice-oriented system. In one embodiment, all or nearly allinteractions may be traced. As shown in 710, the trace information maybe sent from the services to an interaction monitoring daemon.

As shown in 715, the trace information may be stored using the daemonaccording to a first rate. Storing the trace information may includepassing the trace information to a storage node, storing the traceinformation in local persistent storage, or otherwise preserving thetrace information rather than discarding the trace information. Byapplying the first rate, only a percentage (as indicated by the firstrate) or other subset of trace data may be stored.

As shown in 720, a second rate may be determined based (at least inpart) on information external to the daemon. The second rate may behigher or lower than the first rate. For example, the second rate may bedetermined by an interaction monitoring server based on the monitoringof suitable conditions and/or attributes throughout the service-orientedsystem. In one embodiment, the collective performance of the multipleservices or instances of a service, as well as other components, may bemonitored to generate the information external to the daemon. In oneembodiment, the individual performance of various services and/or othercomponents may be monitored to generate the information external to thedaemon. Accordingly, at least a portion of the information external tothe daemon may represent global conditions in the service-orientedsystem. In some embodiments, externalities used to inform the dynamicadjustment of the storage rate for a daemon may include, for example:the volume of traffic in all or part of the service-oriented system, thesize of a fleet of hosts, a determination of redundancy in the traceinformation, an anticipation of higher or lower traffic volume at afuture point in time, a determination that a performance threshold hasbeen crossed, and/or a determination that a performance threshold willbe crossed at a future point in time.

As shown in 725, the second rate may be assigned to the daemon. In oneembodiment, the interaction monitoring server may assign the rate to thedaemon using any suitable technique for policy implementation. As shownin 730, additional trace information may be stored using the daemonaccording to the second rate. In this manner, a feedback loop may beimplemented to dynamically adjust the amount of trace information beingstored or otherwise preserved using a particular daemon. To implementthe feedback loop for rate changes, any of the operations shown in FIG.7 may be performed repeatedly and/or continuously. For example, theoperations shown in 720, 725, and 730 may be repeated any suitablenumber of times to adjust the preservation rate based on recentconditions, current conditions, and/or anticipated conditions in thesystem.

FIG. 8 is a flowchart illustrating a method for dynamically adjusting apreservation rate for a storage node, according to some embodiments. Asshown in 805, trace information may be collected, where the traceinformation describes interactions between services in aservice-oriented system. In one embodiment, all or nearly allinteractions may be traced. As shown in 810, the trace information maybe sent from the services to a storage node, potentially using one ormore interaction monitoring daemons or other components asintermediaries.

As shown in 815, the trace information may be stored using the storagenode according to a first rate. Storing the trace information mayinclude placing the trace information in local storage or otherwisepreserving the trace information rather than discarding the traceinformation. By applying the first rate, only a percentage (as indicatedby the first rate) or other subset of trace data may be stored.

As shown in 820, a second rate may be determined based (at least inpart) on information external to the storage node. The second rate maybe higher or lower than the first rate. For example, the second rate maybe determined by an interaction monitoring server based on themonitoring of suitable conditions and/or attributes throughout theservice-oriented system. In one embodiment, the collective performanceof the multiple services or instances of a service, as well as othercomponents, may be monitored to generate the information external to thestorage node. In one embodiment, the individual performance of variousservices and/or other components may be monitored to generate theinformation external to the storage node. Accordingly, at least aportion of the information external to the storage node may representglobal conditions in the service-oriented system. In some embodiments,externalities used to inform the dynamic adjustment of the storage ratefor a storage node may include, for example: the volume of traffic inall or part of the service-oriented system, the size of a fleet ofhosts, a determination of redundancy in the trace information, ananticipation of higher or lower traffic volume at a future point intime, a determination that a performance threshold has been crossed,and/or a determination that a performance threshold will be crossed at afuture point in time.

As shown in 825, the second rate may be assigned to the storage node. Inone embodiment, the interaction monitoring server may assign the rate tothe storage node using any suitable technique for policy implementation.As shown in 830, additional trace information may be stored using thestorage node according to the second rate. In this manner, a feedbackloop may be implemented to dynamically adjust the amount of traceinformation being stored or otherwise preserved using a particularstorage node. To implement the feedback loop for rate changes, any ofthe operations shown in FIG. 8 may be performed repeatedly and/orcontinuously. For example, the operations shown in 820, 825, and 830 maybe repeated any suitable number of times to adjust the preservation ratebased on recent conditions, current conditions, and/or anticipatedconditions in the system.

FIG. 9 is a flowchart illustrating a method for dynamically adjusting arate for an entity using a feedback loop, according to some embodiments.As discussed above, the feedback loop for dynamic rate adjustment may beapplied to any suitable entity that participates in a tracingenvironment, including services or components in a distributed,service-oriented system as well as other entities such as softwareobjects, nodes in a social graph, etc. As shown in 905, traceinformation may be collected or preserved at an entity according to afirst rate. The trace information may describe interactions between theentity and one or more additional entities in a system.

As shown in 910, the system may be monitored for information, includinginformation external to the entity. As discussed above, any suitableinformation may be gathered, including performance metrics forindividual entities within the system, aggregate performance metrics formultiple entities within the system, and/or information from outside thesystem. As shown in 915, a new rate may be determined based (at least inpart) on information external to the entity. The new rate may be higheror lower than the first rate. For example, the new rate may bedetermined by an interaction monitoring server based on the monitoringof suitable conditions and/or attributes throughout the system.

As shown in 920, the new rate may be assigned to the entity. In oneembodiment, the interaction monitoring server may assign the new rate tothe entity using any suitable technique for policy implementation. Asshown in 925, trace information may be collected or preserved at theentity according to the new rate. The method may then proceed with themonitoring operation shown in 910. Accordingly, the operations shown in910-925 may be performed any suitable number of times to continuouslyand dynamically adjust the rates for initiating traces and/or preservingtrace information at an entity.

Tracking Service Requests

As described above, a given parent request may result in multiple childservice calls to other services. In various embodiments of the systemand method for tracking service requests, request identifiers embeddedwithin such service calls (or located elsewhere) may be utilized togenerate a stored representation of a call graph for a given request. Invarious embodiments, such request identifiers may be stored in log filesassociated with various services. For instance, a service may storeidentifiers for inbound requests in an inbound request log and/or storeidentifiers for outbound requests in an outbound request log. In variousembodiments, call graph generation logic may generate a representationof a call graph from identifiers retrieved from such logs. Suchrepresentations may be utilized for diagnosing errors with requesthandling, providing developer support, and performing traffic analysis.

FIG. 10 illustrates an example format for a request identifier 2100 ofvarious embodiments. As described in more detail below, requestidentifiers of the illustrated format may be passed along with servicerequests. For instance, a service that calls another service may embedin the call an identifier formatted according to the format illustratedby FIG. 10. For example, a requesting service may embed a requestidentifier within metadata of a request. In various embodiments,embedding a request identifier in a service request may includeembedding within the service request, information that specifies wherethe request identifier is located (e.g., a pointer or memory address ofa location in memory where the request identifier is stored). Thevarious components of the illustrated request identifier format aredescribed in more detail below.

An origin identifier (ID) 2110 may be an identifier assigned to allrequests of a given call graph, which includes the initial root requestas well as subsequent requests spawned as a result of the initial rootrequest. For example, as described above, the service-oriented systemsof various embodiments may be configured to process requests fromvarious internal or external systems, such as client computer systems orcomputer systems consuming networked-based services. To fulfill one ofsuch requests, the service-oriented system may call multiple differentservices. For instance, service “A” may be the initial service called tofulfill a request (e.g., service “A” may be called by an externalsystem). To fulfill the initial request, service “A” may call service“B,” which may call service “C,” and so on. Each of such services mayperform a particular function or quantum of work in order to fulfill theinitial request. In various embodiments, each of such services may beconfigured to embed the same origin identifier 2110 into a request of(or call to) another service. Accordingly, each of such requests may beassociated with each other by virtue of containing the same originidentifier. As described in more detail below, the call graph generationlogic of various embodiments may be configured to determine that requestidentifiers having the same origin identifier are members of the samecall graph.

The manner in which the origin identifier may be represented may varyaccording to various embodiments and implementations. One particularexample of an origin identifier may include a hexadecimal stringrepresentation of a standard Universally Unique Identifier (UUID) asdefined in Request for Comments (RFC) 4122 published by the InternetEngineering Task Force (IETF). In one particular embodiment, the originidentifier may contain only lower-case alphabetic characters in order toenable fast case-sensitive comparison of request identifiers (e.g., acomparison performed by the call graph generation logic describedbelow). Note that these particular examples are not intended to limitthe implementation of the origin ID. In various embodiments, the originID may be generated according to other formats.

Transaction depth 2120 may indicate the depth of a current requestwithin the call graph. For instance (as described above), service “A”may be the initial service called to fulfill a root request (e.g.,service “A” may be called by an external system). To fulfill the initialrequest, service “A” may call service “B,” which may call service “C,”and so on. In various embodiments, the depth of the initial request maybe set to 0. For instance, when the first service or “root” servicereceives the root service request, the root service (e.g., service “A”)may set the transaction depth 120 to 0. If in response to this requestthe originating service calls one or more other services, thetransaction depth for these requests may be incremented by 1. Forinstance, if service “A” were to call two other services “B1” and “B2,”the transaction depth of the request identifiers passed to such serviceswould be equivalent to 1. The transaction depth for request identifiersof corresponding requests sent by B1 and B2 would be incremented to 2and so on. In the context of a call graph, the transaction depth of aparticular request may in various embodiments represent the distance(e.g., number of requests) between that request and the root request.For example, the depth of the root request may be 0, the depth of arequest stemming from the root request may be 1, and so on. Note that invarious embodiments, such numbering system may be somewhat arbitrary andopen to modification.

The manner in which the origin identifier may be represented may varyaccording to various embodiments and implementations. One particularexample of a transaction depth may be represented as a variable-widthbase-64 number. In various embodiments, the value of a given transactiondepth may be but need not be a value equivalent to the increment of theprevious transaction depth. For instance, in some embodiments, eachtransaction depth may be assigned a unique identifier, which may beincluded in the request identifier instead of the illustratedtransaction depth 2120.

Interaction identifiers 2130 a-2130 n, collectively referred to asinteraction identifier(s) 2130, may each identify a single request (orservice call) for a given call graph. For instance (as described above),service “A” may be the initial service called to fulfill a request(e.g., service “A” may be called by an external system). To fulfill theroot request, service “A” may call service “B,” which may call service“C,” and so on. In one example, the call of service “B” by service “A”may be identified by interaction identifier 2130 a, the call of service“C” by service “B” may be identified by interaction identifier 2130 band so on.

Note that in various embodiments separate service requests between thesame services may have separate and unique interaction identifiers. Forexample, if service “A” calls service “B” three times, each of suchcalls may be assigned a different interaction identifier. In variousembodiments, this characteristic may ensure that the associated requestidentifiers are also unique across service requests between the sameservices (since the request identifiers include the interactionsidentifiers).

Note that in various embodiments the interaction identifier may be butneed not be globally unique (e.g., unique with respect to all otherinteraction identifiers). For instance, in some embodiments, a giveninteraction identifier for a given request need be unique only withrespect to request identifiers having a particular origin identifier2110 and/or a particular parent interaction identifier, which may be theinteraction identifier of the request preceding the given request in thecall graph (i.e., the interaction identifier of the request identifierof the parent service). In one example, if service “A” were to call twoother services “B1” and “B2,” the request identifier of service “B 1”and the request identifier of service “B2” would have separateinteraction identifiers. Moreover, the parent interaction identifier ofeach of such interaction identifiers may be the interaction identifierof the request identifier associated with the call of service “A.” Therelationship between interaction identifiers and parent interactionidentifiers is described in more detail below.

In various embodiments, interaction identifiers may be generatedrandomly or pseudo-randomly. In some cases, the values generated for aninteraction identifier may have a high probability of uniqueness withinthe context of parent interaction and/or a given transaction depth. Insome embodiments, the size of the random numbers that need to begenerated depends on the number of requests a service makes.

Request stack 2140 may include one or more of the interactionidentifiers described above. In various embodiments, the request stackmay include the interaction identifier of the request to which therequest identifier belongs. In some embodiments, the request stack mayalso include other interaction identifiers, such as one or more parentinteraction identifiers of prior requests (e.g., a “stack” or “history”of previous interaction identifiers in the call graph). In variousembodiments, the request stack may have a fixed size. For instance, therequest stack 2140 may store a fixed quantity of interaction identifiersincluding the interaction identifier of the request to which the requestidentifier belongs and one or more parent interaction identifiers.

In various embodiments, the utilization of a request stack having afixed length (e.g., fixed quantity of stored interaction identifiers)may provide a mechanism to control storage and bandwidth throughout theservice-oriented system. For example, the service-oriented system ofvarious embodiments may in some cases receive numerous (e.g., thousands,millions, or some other quantity) of services requests per a given timeperiod (e.g., per day, per week, or some other time period), such asrequests from network-based browsers (e.g., web browsers) on clientsystems or requests from computer systems consuming network-basedservices (e.g., web services). In some embodiments, a request identifieradhering to the format of request identifier 2100 may be generated foreach of such requests and each of any subsequent child requests. Due tothe shear number of requests that may be handled by the service-orientedsystems of various embodiments, even when the request stack of a singlerequest identifier is of a relatively small size (e.g., a few bytes),the implications on storage and bandwidth of the overall system may insome cases be significant. Accordingly, various embodiments may includeensuring that each request identifier contains a request stack equal toand/or less than a fixed stack size (e.g., a fixed quantity ofinteraction identifiers). Similarly, various embodiments may includefixing the length of each interaction identifier stored as part of therequest stack (e.g., each interaction identifier could be limited to asingle byte, or some other size). By utilizing interaction identifiersof fixed size and/or a request stack of a fixed size, variousembodiments may be configured to control the bandwidth and/or storageutilization of the service-oriented system described herein. Forinstance, in one example, historical request traffic (e.g., the numberof requests handled by the service oriented system per a given timeperiod) may be monitored to determine an optimal request stack sizeand/or interaction identifier size in order to prevent exceeding thebandwidth or storage limitations of the service-oriented system.

In various embodiments, the utilization of a request stack having afixed length (e.g., fixed quantity of stored interaction identifiers)may provide a mechanism to control one or more fault tolerancerequirements of the system including but not limited to durability withrespect to data loss and other errors (associated with individualservices and host systems as well as the entire service-orientedsystem). For example, in some embodiments, the larger the size of therequest stack (e.g., the more interaction identifiers included within agiven request identifier), the more fault tolerant the system becomes.

In embodiments where request stack 2140 includes multiple interactionidentifiers, the request stack may serve as a history of interactionidentifiers. For instance, in the illustrated embodiment, interactionidentifier 2130 a-2130 n may represent a series of interactionidentifiers in ascending chronological order (where interactionidentifier 2130 a corresponds to the oldest service call and interactionidentifier 2130 n corresponds to the most recent service call).

In addition to the illustrated elements, request identifier 2100 may invarious embodiments include one or more portions of data for errordetection and/or error correction. Examples of such data include but arenot limited to various types of checksums.

FIG. 11 illustrates an example transaction flow for a root request andmultiple child requests associated with the same root request. Asillustrated, the transaction flow may begin with the receipt of a rootrequest by service “A.” For instance, this initial request mightoriginate from a client computer system (e.g., from a web browser) orfrom another computer system requesting a service to consume. Tocompletely fulfill the request, service “A” may perform some quantum ofwork and/or request the services of another service, such as service “B”(see, e.g., request identifier 2220). Service “B” may call anotherservice “C” (see, e.g., request identifier 2230) and so on asillustrated (see, e.g., request identifiers 2240-2250). As illustrated,since each request identifier 2210-2250 corresponds to a request of thesame transaction, each of such request identifiers include the sameorigin identifier “343CD324.” For instance, each of services A-D mayembed such origin identifier within each of such request identifiers(described in more detail with respect to FIG. 12). Furthermore, in theillustrated embodiment, the request identifier corresponding to theinitial service request includes a transaction depth of 0 since therequest identifier is a parent request identifier, as described above.Each subsequent child request identifier includes a transactionidentifier equivalent to the previous requests transaction depth plus anincrement value. In other embodiments, instead of incremented values,the transaction depths may be values that uniquely identify atransaction depth with respect to other depths of a given call graph;such values may but need not be increments of each other.

In the illustrated example, each request identifier 2210-2250 includes arequest stack of a fixed size (e.g., three interaction identifiers). Inother embodiments, larger or smaller request stacks may be utilized aslong as the request stack includes at least one interaction identifier.Furthermore, in some embodiments, request stack sizes may be of uniformsize across the service-oriented system (as is the case in theillustrated embodiment). However, in other embodiments, subsets ofservices may have different request stack sizes. For instance, a portionof the service-oriented system may utilize a particular fixed stack sizefor request identifiers whereas another portion of the service-orientedsystem may utilize another fixed stack fixed stack size for requestidentifiers.

Referring collectively to FIG. 11 and FIG. 12, a representation of thereceipt of an inbound service request (or service call) 2310 as well asthe issuance of an outbound request 2320 by service 2300 is illustrated.Request identifiers 2240 and 2250 of FIG. 12 may correspond to thelike-numbered elements of FIG. 11. As illustrated, service 2300 mayreceive an inbound service request 2310. Service 2300 may receive theinbound service request from another service within the service-orientedsystem, according to various embodiments. Inbound service request 2310may include the requisite instructions or commands for invoking service2300. In various embodiments, inbound service request 2310 may alsoinclude a request identifier 2240, which may include values for anorigin identifier, transaction depth, and request stack, as describedabove with respect to FIG. 11. In various embodiments, requestidentifier 2240 may be embedded within inbound service request 2310(e.g., as metadata). For example, according to various embodiments, therequest identifier may be presented as part of metadata in a serviceframework, as part of a Hypertext Transfer Protocol (HTTP) header, aspart of a SOAP header, as part of a Representational State Transfer(REST) protocol, as part of a remote procedural call (RPC), or as partof metadata of some other protocol, whether such protocol is presentlyknown or developed in the future. In other embodiments, requestidentifier 2240 may be transmitted to service 2300 as an elementseparate from inbound service request 2310. In various embodiments,request identifier 2240 may be located elsewhere and inbound servicerequest 2310 may include information (e.g., a pointer or memory address)for accessing the request identifier at that location.

In response to receiving the inbound service request, service 2300 mayperform a designated function or quantum of work associated with therequest, such as processing requests from client computer systems orcomputer systems requesting web services. In various embodiments,service 2300 may be configured to store a copy of request identifier2240 within inbound log 2330. In some cases, service 2300 may requirethe services of another service in order to fulfill a particularrequest, as illustrated by the transmission of outbound service request2320.

As is the case in the illustrated embodiment, service 2300 may beconfigured to send one or more outbound service requests 2320 to one ormore other services in order to fulfill the corresponding root request.Such outbound service requests may also include a request identifier2250 based at least in part on the received request identifier 2240.Request identifier 2250 may be generated by service 2300 or some othercomponent with which service 2300 is configured to coordinate. Sinceoutbound service request 2320 is caused at least in part by inboundservice request 2310 (i.e., request 2320 stems from request 2310), theoutbound service request 2320 and the inbound service request 2310 canbe considered to be constituents of the same call graph. Accordingly,service 2300 (or some other component of the service-oriented framework)may be configured to generate request identifier 2250 such that therequest identifier includes the same origin identifier as that of theinbound service request 2310. In the illustrated embodiment, such originidentifier is illustrated as “343CD324.” For instance, in oneembodiment, service 2300 may be configured to determine the value of theorigin identifier of the request identifier of the inbound servicerequest and write that same value into the request identifier of anoutbound service request. In various embodiments, service 2300 (or someother component of the service-oriented framework) may also beconfigured to generate request identifier 2250 such that the requestidentifier includes a transaction depth value that indicates thetransaction depth level is one level deeper than the transaction depthof the parent request (e.g., inbound service request 2310). Forinstance, in one embodiment, any given call graph may have variousdepths that each have their own depth identifier. In some embodiments,such depth identifiers may be sequential. Accordingly, in order togenerate request identifier 2250 such that it includes a transactiondepth value that indicates the transaction depth level is one leveldeeper than the transaction depth of the parent request (e.g., inboundservice request 2310), service 2300 may be configured to determine thevalue of the transaction depth from the parent request, sum that valuewith an increment value (e.g., 1, or some other increment value), andstore the result of such summation as the transaction depth value of therequest identifier of the outbound service request. In the illustratedembodiment, the transaction depth value of the inbound requestidentifier 2240 is 3 whereas the transaction depth value of the outboundrequest identifier 2250 is 4.

In some cases, transaction depth identifiers may instead haveidentifiers that are not necessarily related to each other sequentially.Accordingly, in some embodiments, service 2300 may be configured todetermine the transaction depth value from the request identifier of theparent request. From that value, service 2300 may determine the actualdepth level corresponding to the transaction depth value (e.g., via alookup table that provides a sequential listing of transaction depthlevels to corresponding transaction depth values). From that depthlevel, service 2300 may be configured to determine the next sequentialtransaction depth (e.g., via a lookup table that provides a sequentiallisting of transaction depth levels to corresponding transaction depthvalues) as well as the transaction depth value corresponding to thattransaction depth. Service 2300 may be configured to store suchtransaction depth value as the transaction depth value of the requestidentifier of the outbound service request.

Service 2300 may also be configured to generate request identifier 2250of the outbound service request such that the request identifier has arequest stack that includes an interaction identifier associated withthe outbound service request and all of the interaction identifiers ofthe request stack of request identifier 2240 except for the oldestinteraction identifier, which in many cases may also be the interactionidentifier corresponding to a request at the highest transaction depthlevel when compared to the transaction depth levels associated with theother interaction identifiers of the request stack. For example, theroot request may occur at transaction depth “0,” a subsequent requestmay occur at transaction depth “1,” another subsequent request may occurat transaction depth “2,” and so on. In some respects, the request stackmay operate in a fashion similar to that of a first in, first out (FIFO)buffer, as described in more detail below.

To generate the request stack of request identifier 2250, service 2300may be configured to determine the interaction identifiers presentwithin the request stack of request identifier 2240. Service 2300 mayalso be configured to determine the size of the request stack that is tobe included within request identifier 2250 (i.e., the quantity ofinteraction identifiers to be included within the request stack). Insome embodiments, this size may be specified by service 2300, anotherservice within the service-oriented system (e.g., the service that is toreceive request 2320), or some other component of the service-orientedsystem (e.g., a component storing a configuration file that specifiesthe size). In other embodiments, the size of the request stack may bespecified by service 2300. In one embodiment, the size of the requeststack may be dynamically determined by service 2300 (or some othercomponent of the service-oriented system). For instance, service 2300may be configured to dynamically determine the size of the request stackbased on capacity and/or utilization of system bandwidth and/or systemstorage. In one example, service 2300 may be configured to determinethat bandwidth utilization has reached a utilization threshold (e.g., athreshold set by an administrator). In response to such determination,service 2300 may be configured to utilize a smaller request stack sizein order to conserve bandwidth. In various embodiments, a similarapproach may be applied to storage utilization.

Dependent upon the size of the inbound request stack and the determinedsize of the outbound request stack (as described above), a number ofdifferent techniques may be utilized to generate the request stack ofrequest identifier 2250, as described herein. In one scenario, the sizeof the inbound request stack may be the same as the determined size ofthe outbound request stack, as is the case in the illustratedembodiment. In this scenario, if the size of the outbound servicerequest stack is to be n interaction identifiers, service 2300 may beconfigured to determine the (n−1) most recent interaction identifiers ofthe request stack of the inbound request identifier. Service 2300 may beconfigured to embed the (n−1) most recent interaction identifiers of theinbound request stack into the request stack of the outbound requestidentifier 2250 in addition to a new interaction identifier thatcorresponds to request 2320 issued by service 2300. In the illustratedembodiment, for each request identifier, the oldest interactionidentifier is illustrated on the leftmost portion of the request stackand the newest interaction identifier is illustrated on the rightmostportion. In the illustrated embodiment, to generate the request stack ofthe outbound request identifier, service 300 may be configured to takethe request stack of the inbound request identifier, drop the leftmost(e.g., oldest) interaction identifier, shift all other interactionidentifiers to the left by one position, insert a newly generatedinteraction identifier for the outbound request, and embed this newlygenerated request stack in the request identifier of the outboundrequest.

In another scenario, the size of the request stack of the inboundservice request identifier 2240 may be less than the size of thedetermined request stack size for the outbound service requestidentifier 2250. In these cases, the request stack size of the outboundservice request may enable all of the interaction identifiers of therequest stack of the inbound service request identifier to be includedwithin the request stack of the outbound service request identifier.Accordingly, in various embodiments, service 2300 may be configured toembed all of the interaction identifiers in the request stack of theoutbound request identifier 2250 in addition to a new interactionidentifier that corresponds to request 2320 issued by service 2300.

In an additional scenario, the size of the request stack of the inboundservice request identifier 2240 may be greater than the size of thedetermined request stack size for the outbound service requestidentifier 2250. For instance, if the size of the request stack for theoutbound service request identifier is m interaction identifiers and thesize of the request stack for the inbound request identifier is m+xinteraction identifiers (where x and m are positive integers), service2300 may be configured to determine the (m−1) most recent interactionidentifiers of the request stack of the inbound request identifier.Service 2300 may also be configured to embed such (m−1) most recentinteraction identifiers of the request stack of the inbound requestidentifier into the request stack of the outbound request identifier inaddition to a new interaction identifier that corresponds to requestissued by service 2300.

As described above, inbound request log 2330 may be managed by service2300 and include records of one or more inbound service requests. In oneembodiment, for each inbound service request received, service 2300 maybe configured to store that request's identifier (which may include anorigin identifier, transaction depth, and request stack, as illustrated)within the inbound request log. In various embodiments, service 2300 mayalso store within the log various metadata associated with each inboundservice request identifier. Such metadata may include but is not limitedto timestamps (e.g., a timestamp included within the request, such as atimestamp of when the request was generated, or a timestamp generatedupon receiving the request, such as a timestamp of when the request wasreceived by service 2300), the particular quantum of work performed inresponse to the request, and/or any errors encountered while processingthe request. In various embodiments, outbound request log 2340 mayinclude information similar to that of inbound request log 2330. Forexample, for each outbound request issued, service 2300 may store arecord of such request within outbound request log 2340. For instance,service 2300 may, for each outbound request, store that request'sidentifier within outbound request log 2340. As is the case with inboundrequest log 2330, service 2300 may also store within outbound requestlog 2340 various metadata associated with requests including but notlimited to metadata such as timestamps and errors encountered.

Referring collectively to FIG. 12 and FIG. 13, each service within theservice-oriented system may include a log reporting agent, such as logreporting agent 2350. Log reporting agent 2350 may in variousembodiments report the contents of inbound request log 2330 and/oroutbound request log 2340 to a log repository (e.g., a data store, suchas a database or other location in memory). One example of such arepository is illustrated log repository 2410 of FIG. 13. Variousprotocols for transmitting records from the logs of a service 2300 to alog repository may be utilized according to various embodiments. In someembodiments, the log reporting agent may periodically or aperiodicallyprovide log information to the log repository. In various embodiments,the log reporting agent may be configured to service requests for loginformation, such as a request from the log repository or some othercomponent of the service-oriented system. In some embodiments, inaddition to or as an alternative to reporting log information from logs2330 and 2340, log reporting agent 2350 may report log information tothe log repository in real-time (in some cases bypassing the storage ofinformation within the logs altogether). For instance, as a request isdetected or generated, the log reporting agent may immediately reportthe information to the log repository. In various embodiments, log datamay specify, for each request identifier, the service that generated therequest identifier and/or the service that received the requestidentifier.

As illustrated in FIG. 13, multiple services 2300 a-2300 h within theservice-oriented system may be configured to transmit respective logdata 2400 a-2400 h to log repository 2410. The data stored within logrepository 2410 (e.g., service request identifiers and associatedmetadata) may be accessed by call graph generation logic 2420. Callgraph generation logic may be configured to generate a data structurerepresenting one or more call graphs, such as call graph data structures2430. As described above, the particular services called to fulfill aroot request may be represented as a call graph that specifies, for aparticular service called, the service that called the particularservice and any services called by the particular service. For instance,since a root request may result in a service call which may propagateinto multiple other services calls throughout the service orientedsystem, a call graph may in some cases include a deep and broad treewith multiple branches each representing a sequences of service calls.

FIG. 14 illustrates a visual representation of such a call graph datastructure that may be generated by call graph generation logic 2420. Invarious embodiments, a call graph data structure may include any datastructure that specifies, for a given root request, all the servicescalled to fulfill that root request. Note that while FIG. 14 and theassociated description pertain to an acyclic call graph, thisrepresentation is not inclusive of all variations possible for such acall graph. For instance, in other embodiments, a call graph may berepresented by any directed graph (including graphs that includedirected cycles) dependent on the nature of the service requests withinthe service-oriented system. Additionally, for a given one of suchservices, the call graph data structure may specify the service thatcalled the given service as well as any services called by the givenservice. The call graph data structure may additionally indicate ahierarchy level of a particular service within a call graph. Forinstance, in the illustrated embodiment, service 2500 is illustrated asa part of the first level of the hierarchy, service 2510 is illustratedas part of the second level of the hierarchy and so on.

To generate such a call graph, call graph generation logic may beconfigured to collect request identifiers (e.g., request identifiers2502, 2512, 2514, 2516, 2542 and 2544) that each include the same originidentifier. In the illustrated embodiment, “563BD725” denotes an exampleof such an origin identifier. In various embodiments, call graphgeneration logic may mine (e.g., perform a search or other dataanalysis) log data associated with various services in order to find acollection of request identifiers that correspond to the same originidentifier (and thus correspond to the same root request, e.g., rootrequest 2501).

In various embodiments, inbound and outbound request logs may bemaintained for each service. In these cases, call graph generation logic2420 may be configured to compare request identifiers in order todetermine that a given service called another service in the process offulfilling the root request. For example, in one embodiment, the callgraph generation logic may compare a request identifier from a givenservice's outbound request log to the request identifier from anotherservice's inbound request log. If a match is detected, the call graphgeneration logic may indicate that the service corresponding to thatoutbound request log called the service corresponding to that inboundrequest log. For example, call graph generation logic may discover arequest identifier equivalent to request identifier 2502 within theoutbound request log associated with service 2500. In this example, callgraph generation logic may also locate a request identifier equivalentto request identifier 2502 within the inbound log of service 2510. Inresponse to this match, call graph generation logic may indicate that anedge (representing a service call) exists between two particular nodesof the call graph (e.g., the node corresponding to service 2500 and thenode corresponding to service 2510). The above-described process may berepeated to determine the illustrated edges that correspond to requestidentifiers 2512, 2514, 2516, 2542 and 2544. In other embodiments, sincethe manner in which interaction identifiers are generated may ensurethat each interaction identifier is unique for a given depth level andorigin identifier, the call graph generation logic may instead searchfor matching interaction identifiers between request identifiers ofadjacent depth levels instead of searching for matching requestidentifiers.

In other embodiments, only one type of log (e.g., either inbound oroutbound) may be maintained for a given service. For example, if onlyoutbound request logs are maintained for each of the services, then thecall graph generation logic 2420 may utilize different techniques fordetermining an edge that represents a service call in the call graphdata structure. In one example, call graph generation logic may comparetwo request identifiers that have adjacent depth values. For instance,in the illustrated embodiment, the call graph generation logic may beconfigured to compare request identifier 2502 to request identifier2514, since such request identifiers contain the adjacent depth valuesof 1 and 2. In this case, the call graph generation logic may determinewhether the most recent interaction identifier of request identifier2502 (e.g., 3B) is equivalent to the 2nd most recent interactionidentifier of request identifier 2514 (e.g., 3B). For request identifier2514, the 2nd most recent interaction identifier is evaluated since themost recent interaction identifier position will be fill with a newinteraction identifier inserted by the service that generated requestidentifier 2514 (in this case, service 2530). In the illustratedembodiment, this comparison returns a match since the values for theinteraction identifiers are equivalent. In response to such match, thecall graph generation logic may be configured to indicate within thedata structure that an edge (representing a service call) exists betweenservice 2500 and 2510.

In various embodiments, the call graph generation logic 2420 may beconfigured to generate a call graph in the presence of data loss. Forinstance, consider the case where the service oriented system maintainsoutbound service logs and the log data for service 2510 is lost, asmight be the case in the event of a failure on the host system on whichservice 2510 runs or in the case of a failure of log repository 2410.Since the request identifiers of various embodiments may include arequest stack of multiple interaction identifiers, multiple layers ofredundancy may be utilized to overcome a log data loss. In this example,since the outbound log data for service 2510 is lost, requestidentifiers 2512, 2514, and 2516 may not be available. Accordingly, thecall graph generation logic may be configured to utilize a requestidentifier from a lower depth level to reconstruct the pertinent portionof the call graph. While request identifiers 2512, 2514, and 2516 may benot be available due to data loss, the request identifier 2542 (and2544) is available. Since request identifier 2542 includes a stack or“history” of interaction identifiers, that request identifier may beutilized to obtain information that would have been available if requestidentifier 2516 were not lost to data failure. Since request identifier2542 has a depth level that is two levels lower than the depth level ofrequest identifier 2502, the call graph generation logic may utilize thethird most recent (not the second most recent as was the case in theprevious example) interaction identifier. In this example, the thirdmost recent interaction identifier is evaluated since that positionwould contain the interaction identifier generated by service 2500 inthe illustrated embodiment. If the call graph generation logicdetermines that the most recent interaction identifier of requestidentifier 2502 matches the third most recent interaction identifier ofrequest identifier 2542, the call graph generation logic may determinethat service 2500 called service 2510 even if the log data for service2510 is unavailable (e.g., due to data loss). Accordingly, the callgraph generation logic may indicate an edge (representing a servicecall) exists between service 2500 and service 2510 within the generatedcall graph data structure.

In addition to the request identifiers described above, metadatarelating to service interactions may be collected (e.g., by the logreporting agent 2350) and used in the generation of call graphs. Invarious embodiments, the metadata includes, but is not limited to, anyof the following: a timestamp, an indication of whether the interactionis on the client side or server side, the name or other identifier ofthe application programming interface (API) invoked for the interaction,the host name, data that describes the environment (e.g., a versionnumber of a production environment or test environment), and/or anyother metadata that is suitable for building the call graphs and/orcomparing one set of call graphs to another. The collected metadata maybe used to determine a graph of service interactions, i.e., byidentifying or distinguishing nodes and edges from other nodes andedges. If the metadata includes information identifying a test runand/or the version of an environment, then the metadata may enablereporting of test results (e.g., test coverage metrics and/or reports)by test run and/or environment.

In some embodiments, various metadata may also be included within suchcall graph data structure, such as timestamps, the particular quantum ofwork performed in response to a given request, and/or any errorsencountered while processing a given request. For example, theillustrated services may record timestamps of when a request isreceived, when a request is generated, and/or when a request is sent toanother service. These timestamps may be appended to the call graph datastructure to designate latency times between services (e.g., bycalculating the time difference between when a request is sent and whenit is received). In other cases, metadata may include error informationthat indicates any errors encountered or any tasks performed whileprocessing a given request. In some embodiments, such metadata mayinclude host address (e.g., an Internet Protocol address of a host) inorder to generate a graph structure that indicates which host machinesare processing requests (note that in some embodiments host machines mayhost multiple different services).

The system and method for tracking service requests described herein maybe configured to perform a variety of methods. The call graph generationlogic described herein may be configured to receive multiple requestidentifiers, each associated with a respective one of multiple servicerequests. Each given request identifier may include an origin identifierassociated with a root request, a depth value specifying a location ofthe associated service request within a sequence of service requests,and a request stack including one or more interaction identifiersassigned to a service request issued from one service to anotherservice. For example, receiving multiple request identifiers may in somecases include receiving log data that includes such request identifiers.For instance, the call graph generation logic may receive log datadirectly from host systems that host the services of theservice-oriented system described herein. In some cases, the call graphgeneration logic may receive log data from one or more log repositoriessuch as log repository 2410 described above. In general, the call graphgeneration logic may utilize any of the techniques for obtaining requestidentifiers described above with respect to call graph generation logic2420.

The call graph generation logic may further, based on multiple ones ofthe request identifiers that each include an origin identifierassociated with a particular root request, generate a data structurethat specifies a hierarchy of services called to fulfill that particularroot request; wherein, based on one or more of the interactionidentifiers and one or more of the depth values, the generated datastructure specifies, for a given service of said hierarchy: a parentservice that called the given service, and one or more child servicescalled by the given service. For example, in various embodiments,generating the data structure may include determining that each of asubset of the multiple request identifiers includes the same originidentifier as well as indicating each associated service request as anode of the hierarchy within the data structure. Examples of such nodesare illustrated in FIG. 14 as services 2500, 2510, 2520, 2530, 2540,2550 and 2560. Generating such data structure may also include, for eachnode within the hierarchy, assigning the node to a level within thehierarchy based on the transaction depth value of the request identifierassociated with the service request corresponding to that node. Examplesof such depth level values are described above with respect totransaction depth 2120 of FIG. 10. Generating the data structure mayalso include determining that the request stack of a given node at agiven level within the hierarchy includes an interaction identifier thatis the same as an interaction identifier of the request stack of anothernode located within an adjacent level of the hierarchy. For instance,the call graph generation logic may include any of the variousinteraction identifier comparison techniques described above. Inresponse to determining such match, the call graph generation logic mayindicate a service call as an edge between said given node and saidother node. Examples of such an edge are illustrated as the edgescoupling the nodes of FIG. 14 described above.

In various embodiments, the techniques for analyzing request identifiersand generating a call graph may be performed on an incremental basis.For example, as request identifiers are updated (e.g., as logs and/orlog repositories receive new data), the call graph generation logicdescribed herein may be configured to incrementally update the generatedcall graph data structure to reflect the newly reported requests. Insome embodiments, the techniques described herein may be performed on adepth-level basis. For example, as request identifiers are received(e.g., by the log repository or call graph generation logic describedherein), each identifier may be categorized (e.g., placed in acategorized directory) based on transaction depth.

In various embodiments, the generated call graph data structuresdescribed herein may be utilized for diagnostic purposes. For instance,as described above, the call graph data structure may include metadata,such as a record of error(s) that occur when processing a request.Because this metadata may be associated with specific nodes and/orservice calls, various embodiments may include determining sources oferrors or faults within the service-oriented system. In someembodiments, the generated call graph data structures described hereinmay be utilized for analytical purposes. For example, based on callgraph data structures generated as described herein, various embodimentsmay include determining historical paths of service calls and/or pathanomalies. For instance, various embodiments may include detecting that,for a given root request, one or more services are being calledunnecessarily. For instance, such services may not be needed to fulfillthe particular root request. Accordingly, in some embodiments, suchservices may be culled from processing further requests similar to orthe same as the root request that originally initiated the unnecessaryservice calls (e.g., a re-orchestration process may be employed tomodify the particular services called for a particular type of request).By removing such unnecessary service calls, various embodiments mayconserve resources such as storage and/or bandwidth. In otherembodiments, the generated call graph data structures described hereinmay be utilized for auditing purposes. For example, in the case that theservice oriented system provides network-based services (e.g., webservices) to consumers of such services (who may provide remunerationfor the consumption of services), such consumers may desire to at leastoccasionally view information that confirms they are being charged in afair manner. To provide such information to the consumer, variousembodiments may include providing the consumer with various records suchas records that indicate how frequent they consume network-basedservices and in what quantity. Such information may be generated basedon the call graph data structures described herein.

In one embodiment, the call graph generation logic may receive a firstrequest identifier associated with an inbound service request. Therequest identifier may include an origin identifier associated with aroot request, a depth value specifying a location of the inbound servicerequest within a sequence of service requests, and a request stackincluding multiple interaction identifiers each assigned to a respectiveservice request issued from one service to another service of multipleservices. One example of receiving such a request identifier isillustrated in FIG. 12 as the receipt of inbound service requestidentifier 2240 by service 2300.

The call graph generation logic may also generate a new request stack.The new request stack may include all of the interaction identifiers ofthe first request identifier except for an oldest one of the interactionidentifiers. For instance, as illustrated in FIG. 12, the request stackof outbound request identifier 2250 does not include “6F,” which is theoldest interaction identifier of the inbound service request identifier2240. The new request stack may also include a new interactionidentifier associated with an outbound service request. For instance, asillustrated in FIG. 12, the request stack of outbound service requestidentifier 2250 includes a new interaction identifier “2C.”

The call graph generation logic may also generate a second requestidentifier associated with the outbound service request. The secondrequest identifier may include the origin identifier, a new depth valuespecifying a location of the outbound service request within thesequence of service requests, and the new request stack. One example ofsuch a second request identifier is illustrated as outbound servicerequest identifier 2250 of FIG. 12.

In various embodiments, the call graph generation logic may alsogenerate the new depth value such that the new depth value is a resultof incrementing the first depth value. For example, in the illustratedembodiment of FIG. 12, the depth value of the outbound requestidentifier (i.e., “4”) may be the result of incrementing the depth valueof the inbound request identifier (i.e., “3”). In various embodiments,the call graph generation logic may store either of (or both of) thefirst request identifier and the second request identifier as log dataaccessible to one or more computer systems. For instance, in theillustrated embodiment of FIG. 12, the inbound and outbound requestidentifiers may be stored in inbound request log 2330 and outboundrequest log 2340, respectively.

For each of the interactions between the services 2500, 2510, 2520,2530, 2540, 2550, and 250, a request path or downstream path is shown.For each of the interactions between the services 2500, 2510, 2520,2530, 2540, 2550, and 250, a reply path or upstream path is also shown.In response to each request, the recipient (i.e., downstream) servicemay send a reply to the requesting (i.e., upstream) service at anyappropriate point in time, e.g., after completing the requestedoperation and receiving replies for any further downstream servicescalled to satisfy the request. A downstream service that is a leaf inthe relevant call graph (i.e., a service that calls no further services)may send a reply to the immediately upstream service upon completion ofthe requested operation or upon encountering an error that preventscompletion of the requested operation. A reply may include any suitabledata and/or metadata, such as the output of a requested service in thereply path and/or any error codes or condition codes experienced in thereply path. A reply may also include any suitable element(s) ofidentifying information from the request stack of the correspondingrequest, such as the origin identifier and/or interaction identifiersshown in FIG. 10.

One example system configuration for tracking service requests isillustrated in FIG. 15. As illustrated, the various components of theexample system are coupled together via a network 2180. Network 2180 mayinclude any combination of local area networks (LANs), wide areanetworks (WANs), some other network configured to communicate datato/from computer systems, or some combination thereof. Each of hostsystems 2700 a-c and 2720 may be implemented by a computer system, suchas computer system 3000 described below. Call graph generation logic2420 may be implemented as software (e.g., program instructionsexecutable by a processor of host system 2720), hardware, or somecombination thereof. Call graph data structures 2430 may be generated byhost system logic 420 and stored in a memory of host system 2720. Logrepository 2410 may be implemented as a data store (e.g., database,memory, or some other element configured to store data) coupled tonetwork 2180. In other embodiments, log repository 2410 may beimplemented as a backend system of host system 2720 and accessible tohost system 2720 via a separate network. Host system 2700 a may beconfigured to execute program instruction to implement one or moreservices 2750 a. Such services may include but are not limited to one ormore of network-based services (e.g., a web service), applications,functions, objects, methods (e.g., objected-oriented methods),subroutines, or any other set of computer-executable instructions.Examples of services 2750 include any of the services described above.Host systems 2700 b-c and services 2750 b-c may be configured in asimilar manner.

In various embodiments, the various services of the illustratedembodiment may be controlled by a common entity. However, in someembodiments, external systems, such as a system controlled by anotherentity, may be called as part of a sequence of requests for fulfilling aroot request. In some cases, the external system may adhere to therequest identifier generation techniques described herein and mayintegrate with the various services described above. In the event thatan external system does not adhere to the various techniques forgenerating request identifiers as described herein, the external systemmay be treated as a service that is not visible in the call graph or,alternatively, requests sent back from the external system may betreated as new requests altogether (e.g., as root requests). In variousembodiments, the system configuration may include one or more proxysystems and/or load balancing systems. In some cases, the systemconfiguration may treat these systems as transparent from a requestidentifier generation perspective. In other cases, these systems maygenerate request identifiers according to the techniques describedabove.

In some embodiments, the service-oriented system described herein may beintegrated with other external systems that may utilize differenttechniques for identifying requests. For instance, the requestidentifiers described herein may in various embodiments be wrapped orenveloped in additional data (e.g., additional identifiers, headers,etc.) to facilitate compatibility with various external systems.

Illustrative Computer System

In at least some embodiments, a computer system that implements aportion or all of one or more of the technologies described herein mayinclude a general-purpose computer system that includes or is configuredto access one or more computer-readable media. FIG. 16 illustrates sucha general-purpose computing device 3000. In the illustrated embodiment,computing device 3000 includes one or more processors 3010 coupled to asystem memory 3020 via an input/output (I/O) interface 3030. Computingdevice 3000 further includes a network interface 3040 coupled to I/Ointerface 3030.

In various embodiments, computing device 3000 may be a uniprocessorsystem including one processor 3010 or a multiprocessor system includingseveral processors 3010 (e.g., two, four, eight, or another suitablenumber). Processors 3010 may include any suitable processors capable ofexecuting instructions. For example, in various embodiments, processors3010 may be general-purpose or embedded processors implementing any of avariety of instruction set architectures (ISAs), such as the x86,PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. Inmultiprocessor systems, each of processors 3010 may commonly, but notnecessarily, implement the same ISA.

System memory 3020 may be configured to store program instructions anddata accessible by processor(s) 3010. In various embodiments, systemmemory 3020 may be implemented using any suitable memory technology,such as static random access memory (SRAM), synchronous dynamic RAM(SDRAM), nonvolatile/Flash-type memory, or any other type of memory. Inthe illustrated embodiment, program instructions and data implementingone or more desired functions, such as those methods, techniques, anddata described above, are shown stored within system memory 3020 as code(i.e., program instructions) 3025 and data 3026.

In one embodiment, I/O interface 3030 may be configured to coordinateI/O traffic between processor 3010, system memory 3020, and anyperipheral devices in the device, including network interface 3040 orother peripheral interfaces. In some embodiments, I/O interface 3030 mayperform any necessary protocol, timing or other data transformations toconvert data signals from one component (e.g., system memory 3020) intoa format suitable for use by another component (e.g., processor 3010).In some embodiments, I/O interface 3030 may include support for devicesattached through various types of peripheral buses, such as a variant ofthe Peripheral Component Interconnect (PCI) bus standard or theUniversal Serial Bus (USB) standard, for example. In some embodiments,the function of I/O interface 3030 may be split into two or moreseparate components, such as a north bridge and a south bridge, forexample. Also, in some embodiments some or all of the functionality ofI/O interface 3030, such as an interface to system memory 3020, may beincorporated directly into processor 3010.

Network interface 3040 may be configured to allow data to be exchangedbetween computing device 3000 and other devices 3060 attached to anetwork or networks 3050. In various embodiments, network interface 3040may support communication via any suitable wired or wireless generaldata networks, such as types of Ethernet network, for example.Additionally, network interface 3040 may support communication viatelecommunications/telephony networks such as analog voice networks ordigital fiber communications networks, via storage area networks such asFibre Channel SANs, or via any other suitable type of network and/orprotocol.

In some embodiments, system memory 3020 may be one embodiment of acomputer-readable (i.e., computer-accessible) medium configured to storeprogram instructions and data as described above for implementingembodiments of the corresponding methods and apparatus. However, inother embodiments, program instructions and/or data may be received,sent or stored upon different types of computer-readable media.Generally speaking, a computer-readable medium may includenon-transitory storage media or memory media such as magnetic or opticalmedia, e.g., disk or DVD/CD coupled to computing device 3000 via I/Ointerface 3030. A non-transitory computer-readable storage medium mayalso include any volatile or non-volatile media such as RAM (e.g. SDRAM,DDR SDRAM, RDRAM, SRAM, etc.), ROM, etc, that may be included in someembodiments of computing device 3000 as system memory 3020 or anothertype of memory. Further, a computer-readable medium may includetransmission media or signals such as electrical, electromagnetic, ordigital signals, conveyed via a communication medium such as a networkand/or a wireless link, such as may be implemented via network interface3040. Portions or all of multiple computing devices such as thatillustrated in FIG. 16 may be used to implement the describedfunctionality in various embodiments; for example, software componentsrunning on a variety of different devices and servers may collaborate toprovide the functionality. In some embodiments, portions of thedescribed functionality may be implemented using storage devices,network devices, or special-purpose computer systems, in addition to orinstead of being implemented using general-purpose computer systems. Theterm “computing device,” as used herein, refers to at least all thesetypes of devices, and is not limited to these types of devices.

Various embodiments may further include receiving, sending, or storinginstructions and/or data implemented in accordance with the foregoingdescription upon a computer-readable medium. Generally speaking, acomputer-readable medium may include storage media or memory media suchas magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile ornon-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.),ROM, etc. In some embodiments, a computer-readable medium may alsoinclude transmission media or signals such as electrical,electromagnetic, or digital signals, conveyed via a communication mediumsuch as network and/or a wireless link.

The various methods as illustrated in the Figures and described hereinrepresent exemplary embodiments of methods. The methods may beimplemented in software, hardware, or a combination thereof. In variousof the methods, the order of the steps may be changed, and variouselements may be added, reordered, combined, omitted, modified, etc.Various ones of the steps may be performed automatically (e.g., withoutbeing directly prompted by user input) and/or programmatically (e.g.,according to program instructions).

The terminology used in the description of the invention herein is forthe purpose of describing particular embodiments only and is notintended to be limiting of the invention. As used in the description ofthe invention and the appended claims, the singular forms “a”, “an” and“the” are intended to include the plural forms as well, unless thecontext clearly indicates otherwise. It will also be understood that theterm “and/or” as used herein refers to and encompasses any and allpossible combinations of one or more of the associated listed items. Itwill be further understood that the terms “includes,” “including,”“comprises,” and/or “comprising,” when used in this specification,specify the presence of stated features, integers, steps, operations,elements, and/or components, but do not preclude the presence oraddition of one or more other features, integers, steps, operations,elements, components, and/or groups thereof.

As used herein, the term “if” may be construed to mean “when” or “upon”or “in response to determining” or “in response to detecting,” dependingon the context. Similarly, the phrase “if it is determined” or “if [astated condition or event] is detected” may be construed to mean “upondetermining” or “in response to determining” or “upon detecting [thestated condition or event]” or “in response to detecting [the statedcondition or event],” depending on the context.

It will also be understood that, although the terms first, second, etc.,may be used herein to describe various elements, these elements shouldnot be limited by these terms. These terms are only used to distinguishone element from another. For example, a first contact could be termed asecond contact, and, similarly, a second contact could be termed a firstcontact, without departing from the scope of the present invention. Thefirst contact and the second contact are both contacts, but they are notthe same contact.

Numerous specific details are set forth herein to provide a thoroughunderstanding of claimed subject matter. However, it will be understoodby those skilled in the art that claimed subject matter may be practicedwithout these specific details. In other instances, methods, apparatus,or systems that would be known by one of ordinary skill have not beendescribed in detail so as not to obscure claimed subject matter. Variousmodifications and changes may be made as would be obvious to a personskilled in the art having the benefit of this disclosure. It is intendedto embrace all such modifications and changes and, accordingly, theabove description is to be regarded in an illustrative rather than arestrictive sense.

What is claimed is:
 1. A system, comprising: one or more computingdevices configured to implement an interaction monitoring system,wherein the interaction monitoring system is configured to: determine achange in a sampling rate for an entity, wherein the sampling rate isfor sampling interactions between the entity and one or more additionalentities to collect trace information, wherein determination of thechange in the sampling rate is performed after a collection of the traceinformation has been initiated at the entity and comprises: monitorinformation external to the entity, wherein the information external tothe entity comprises: network traffic metrics determined based at leastin part on a respective volume of traffic at various ones of the one ormore additional entities in a system comprising the entity and the oneor more additional entities, a size of a fleet of hosts for the entityand the one or more additional entities, an indication of an anticipatedhigher or lower network traffic volume in the system comprising theentity and the one or more additional entities, or a determination, madeexternal to the entity, of redundancy in the trace information; anddetermine the change in the sampling rate for the entity based at leastin part on the monitored information external to the entity; andinitiate, at the entity, a collection of additional trace information,wherein the collection of the additional trace information is initiatedaccording to the change in the sampling rate for the entity.
 2. Thesystem as recited in claim 1, wherein the interaction monitoring systemis further configured to: monitor a collective performance of the entityand the one or more additional entities after at least a portion of thetrace information is collected at the entity according to the samplingrate; and determine the information external to the service based on thecollective performance of the entity and the one or more additionalentities.
 3. The system as recited in claim 1, wherein the interactionmonitoring system is further configured to: monitor a performance ofindividual ones of the additional entities after at least a portion ofthe trace information is collected at the entity according to an initialsampling rate; and determine the information external to the entity fromwhich to determine the change in the sampling rate based on theperformance of the individual ones of the additional entities.
 4. Thesystem as recited in claim 1, wherein the change in the sampling rate isdetermined to be an increase in the sampling rate based at least in parton one or more of the network traffic metrics in the system and anindication that the trace information collected is insufficient for ananalysis being conducted or to be conducted.
 5. The system as recited inclaim 1, wherein the change in the sampling rate is determined to be areduction in the sampling rate based at least in part on one or more ofthe network traffic metrics in the system and an indication that thetrace information collected is excessive for an analysis being conductedor to be conducted.
 6. The system as recited in claim 1, wherein theinformation external to the entity indicates an anticipated highernetwork traffic volume in the system and the change in the sampling rateis a reduction in the sampling rate.
 7. The system as recited in claim1, wherein the information external to the entity indicates ananticipated lower network traffic volume in the system and the change inthe sampling rate is an increase in the sampling rate.
 8. Acomputer-implemented method, comprising: determining a change in asampling rate for an entity, wherein the sampling rate is for samplinginteractions between the entity and one or more additional entities tocollect trace information, wherein determining the change in thesampling rate is performed after a collection of the trace informationhas been initiated at the entity and comprises: monitoring informationexternal to the entity, wherein the information external to the entitycomprises: network traffic metrics determined based at least in part ona respective volume of traffic at various ones of the one or moreadditional entities in a system comprising the entity and the one ormore additional entities, a size of a fleet of hosts for the entity andthe one or more additional entities, an indication of an anticipatedhigher or lower network traffic volume in the system comprising theentity and the one or more additional entities, or a determination, madeexternal to the entity, of redundancy in the trace information; anddetermining the change in the sampling rate for the entity based atleast in part on the monitored information external to the entity; andinitiating, at the entity, a collection of additional trace information,wherein the collection of the additional trace information is initiatedaccording to the change in the sampling rate for the entity.
 9. Themethod as recited in claim 8, further comprising: monitoring acollective performance of the entity and the one or more additionalentities after the collection of the trace information is initiated atthe entity, and determining the information external to the entity basedon the collective performance of the entity and the one or moreadditional entities.
 10. The method as recited in claim 8, furthercomprising: monitoring a performance of individual ones of theadditional entities after the collection of the trace information isinitiated at the entity; and determining the information external to theentity based on the performance of the individual ones of the additionalentities.
 11. The method as recited in claim 8, further comprisingdetermining the change in the sampling rate to be an increase in thesampling rate based at least in part on one or more of the networktraffic metrics in the system and an indication that the traceinformation collected is insufficient for an analysis being conducted orto be conducted.
 12. The method as recited in claim 8, furthercomprising determining the change in the sampling rate to be a reductionin the sampling rate based at least in part on one or more of thenetwork traffic metrics in the system and an indication that the traceinformation collected is excessive for an analysis being conducted or tobe conducted.
 13. The method as recited in claim 8, wherein the externalinformation indicates an anticipated higher network traffic volume inthe system and the change in the sampling rate is a reduction in thesampling rate.
 14. The method as recited in claim 8, wherein theexternal information indicates an anticipated lower network trafficvolume in the system and the change in the sampling rate is an increasein the sampling rate.
 15. A computer-readable storage medium storingprogram instructions computer-executable to perform: determining achange in a sampling rate for an entity, wherein the sampling rate isfor sampling interactions between the entity and one or more additionalentities to collect trace information, wherein determining the change inthe sampling rate is performed after a collection of the traceinformation has been initiated at the entity and comprises: monitoringinformation external to the entity, wherein the information external tothe entity comprises: network traffic metrics determined based at leastin part on a respective volume of traffic at various ones of the one ormore additional entities in a system comprising the entity and the oneor more additional entities, a size of a fleet of hosts for the entityand the one or more additional entities, an indication of an anticipatedhigher or lower network traffic volume in the system comprising theentity and the one or more additional entities, or a determination, madeexternal to the entity, of redundancy in the trace information; anddetermining the change in the sampling rate for the entity based atleast in part on the monitored information external to the entity; andinitiating, at the entity, a collection of additional trace information,wherein the collection of the additional trace information is initiatedaccording to the change in the sampling rate for the entity.
 16. Thecomputer-readable storage medium as recited in claim 15, wherein theprogram instructions are computer-executable to perform: monitoring aperformance of the plurality of additional entities; and determining theinformation external to the entity based at least in part on theperformance of the plurality of additional entities.
 17. Thecomputer-readable storage medium as recited in claim 15, wherein theprogram instructions are computer-executable to perform: determining thechange in the sampling rate to be an increase in the sampling rate basedat least in part on one or more of the network traffic metrics in thesystem and an indication that the trace information collected isinsufficient for an analysis being conducted or to be conducted.
 18. Thecomputer-readable storage medium as recited in claim 15, wherein theprogram instructions are computer-executable to perform: determining thechange in the sampling rate to be a reduction in the sampling rate basedat least in part on one or more of the network traffic metrics in thesystem and an indication that the trace information collected isexcessive for an analysis being conducted or to be conducted.
 19. Thecomputer-readable storage medium as recited in claim 15, wherein theexternal information indicates an anticipated higher network trafficvolume in the system and the change in the sampling rate is a reductionin the sampling rate.
 20. The computer-readable storage medium asrecited in claim 15, wherein the external information indicates ananticipated lower network traffic volume in the system and the change inthe sampling rate is an increase in the sampling rate.